logo

Motivation

Sports analytics: collection of relevant, historical, statistics that can provide a competitive advantage to a team or individual.

In this case study, we will analyze data from NBA

In basketball, there are typically 5 positions for players: point guard, shooting guard, small forward, power forward, and center

Hence, NBA players should fall into these 5 categories

Some questions to be answered:

Based on data of all-time leaders from here: http://stats.nba.com/alltime-leaders/

# delete everything
rm(list=ls()) 

library(tidyverse)
library(GGally) # ggplot2-based visualization of correlations
library(factoextra) # ggplot2-based visualization of pca
library(cluster)

Load and prepare data

The dataset contains skill performance of the most important NBA players in the history

See the glossary here: http://stats.nba.com/alltime-leaders/

Performance is per game

historical_players.df = read.csv(file = "nba.csv", header=T, sep=";")

glimpse(historical_players.df)
## Rows: 1,214
## Columns: 26
## $ player_id   <int> 893, 76375, 76127, 201142, 2544, 78497, 947, 77847, 76804,…
## $ player_name <chr> "Michael Jordan", "Wilt Chamberlain", "Elgin Baylor", "Kev…
## $ gp          <int> 1072, 1045, 846, 703, 1061, 932, 914, 792, 791, 1040, 1476…
## $ min         <dbl> 38.25653, 45.79809, 40.02719, 37.37980, 38.90009, 39.23927…
## $ fgm         <dbl> 11.37313, 12.13493, 10.27541, 9.19346, 9.82375, 9.67382, 9…
## $ fga         <dbl> 22.88899, 22.48517, 23.84279, 18.85349, 19.60697, 20.42060…
## $ fg_pct      <dbl> 0.497, 0.540, 0.431, 0.488, 0.501, 0.474, 0.425, 0.436, 0.…
## $ fg3m        <dbl> 0.54198, NA, NA, 1.79232, 1.38266, NA, 1.15864, NA, 0.0973…
## $ fg3a        <dbl> 1.65858, NA, NA, 4.72404, 4.04807, NA, 3.70131, NA, 0.3274…
## $ fg3_pct     <dbl> 0.327, NA, NA, 0.379, 0.342, NA, 0.313, NA, 0.297, NA, 0.2…
## $ ftm         <dbl> 6.83489, 5.79617, 6.81206, 7.01991, 6.10179, 7.68240, 6.97…
## $ fta         <dbl> 8.18284, 11.35120, 8.73641, 7.96017, 8.24882, 9.44313, 8.9…
## $ ft_pct      <dbl> 0.835, 0.511, 0.780, 0.882, 0.740, 0.814, 0.780, 0.761, 0.…
## $ oreb        <dbl> 1.55597, NA, NA, 0.78663, 1.21489, 0.96774, 0.81510, NA, 1…
## $ dreb        <dbl> 4.66791, NA, NA, 6.36984, 6.04807, 2.77419, 2.89825, NA, 3…
## $ reb         <dbl> 6.22388, 22.89378, 13.54965, 7.15647, 7.26296, 5.76824, 3.…
## $ ast         <dbl> 5.25466, 4.44306, 4.31442, 3.78805, 7.03205, 6.69313, 6.15…
## $ stl         <dbl> 2.34515, NA, NA, 1.19488, 1.64844, 2.61290, 2.16958, NA, 1…
## $ blk         <dbl> 0.83302, NA, NA, 1.04979, 0.77003, 0.74194, 0.17943, NA, 0…
## $ tov         <dbl> 2.72761, NA, NA, 3.15932, 3.41093, NA, 3.56893, NA, 3.0141…
## $ pf          <dbl> 2.59608, 1.98565, 3.06856, 1.89189, 1.86334, 2.61266, 1.94…
## $ pts         <dbl> 30.12313, 30.06603, 27.36288, 27.19915, 27.13195, 27.03004…
## $ ast_tov     <dbl> 1.92647, NA, NA, 1.19901, 2.06162, NA, 1.72410, NA, 0.9246…
## $ stl_tov     <dbl> 0.85978, NA, NA, 0.37821, 0.48328, NA, 0.60791, NA, 0.3912…
## $ efg_pct     <dbl> 0.50872, 0.53969, 0.43097, 0.53516, 0.53629, 0.47373, 0.45…
## $ ts_pct      <dbl> 0.56859, 0.54706, 0.49415, 0.60832, 0.58382, 0.54994, 0.51…

These are the 1200 best players in the NBA history

Missing values

hist(rowMeans(is.na(historical_players.df)))

barplot(colMeans(is.na(historical_players.df)), las=2)

# skip players with NA information in any variable
historical_players.df <-
  historical_players.df[rowSums(is.na(historical_players.df)) == 0,]
dim(historical_players.df)
## [1] 1002   26

Conclusions? What can we do?

Create the data frame

nba <- historical_players.df[,3:ncol(historical_players.df)]
nba <- as.data.frame(sapply(nba, as.numeric ))
names = historical_players.df[,2]

dim(nba)
## [1] 1002   24
summary(nba)
##        gp              min              fgm               fga        
##  Min.   : 400.0   Min.   : 8.947   Min.   : 0.6232   Min.   : 1.531  
##  1st Qu.: 532.5   1st Qu.:20.489   1st Qu.: 2.7673   1st Qu.: 5.963  
##  Median : 682.5   Median :25.101   Median : 3.7753   Median : 8.220  
##  Mean   : 718.5   Mean   :25.163   Mean   : 4.0994   Mean   : 8.819  
##  3rd Qu.: 861.0   3rd Qu.:29.802   3rd Qu.: 5.2043   3rd Qu.:11.172  
##  Max.   :1611.0   Max.   :41.120   Max.   :11.3731   Max.   :22.889  
##      fg_pct            fg3m               fg3a            fg3_pct      
##  Min.   :0.3730   Min.   :0.000000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.4370   1st Qu.:0.006258   1st Qu.:0.04473   1st Qu.:0.1520  
##  Median :0.4610   Median :0.137780   Median :0.47784   Median :0.2945  
##  Mean   :0.4645   Mean   :0.430171   Mean   :1.22112   Mean   :0.2492  
##  3rd Qu.:0.4900   3rd Qu.:0.793380   3rd Qu.:2.20637   3rd Qu.:0.3510  
##  Max.   :0.6770   Max.   :3.339720   Max.   :7.62892   Max.   :1.0000  
##       ftm              fta             ft_pct            oreb       
##  Min.   :0.2141   Min.   :0.2441   Min.   :0.4140   Min.   :0.1456  
##  1st Qu.:1.1963   1st Qu.:1.6486   1st Qu.:0.7060   1st Qu.:0.6621  
##  Median :1.7713   Median :2.4028   Median :0.7630   Median :1.1722  
##  Mean   :2.0709   Mean   :2.7368   Mean   :0.7492   Mean   :1.3260  
##  3rd Qu.:2.6476   3rd Qu.:3.5213   3rd Qu.:0.8060   3rd Qu.:1.8559  
##  Max.   :7.1539   Max.   :9.3223   Max.   :0.9050   Max.   :5.0647  
##       dreb             reb              ast               stl        
##  Min.   :0.7673   Min.   : 1.007   Min.   : 0.1975   Min.   :0.1225  
##  1st Qu.:2.0071   1st Qu.: 2.781   1st Qu.: 1.1235   1st Qu.:0.5405  
##  Median :2.8147   Median : 4.031   Median : 1.9202   Median :0.7761  
##  Mean   :3.1508   Mean   : 4.492   Mean   : 2.4371   Mean   :0.8537  
##  3rd Qu.:3.9814   3rd Qu.: 5.833   3rd Qu.: 3.3430   3rd Qu.:1.0794  
##  Max.   :9.7748   Max.   :13.993   Max.   :11.1932   Max.   :2.7112  
##       blk               tov               pf              pts        
##  Min.   :0.00996   Min.   :0.1734   Min.   :0.7944   Min.   : 1.729  
##  1st Qu.:0.18856   1st Qu.:1.0697   1st Qu.:1.9307   1st Qu.: 7.144  
##  Median :0.34270   Median :1.5038   Median :2.2904   Median : 9.769  
##  Mean   :0.52666   Mean   :1.5942   Mean   :2.3360   Mean   :10.700  
##  3rd Qu.:0.69759   3rd Qu.:2.0263   3rd Qu.:2.7452   3rd Qu.:13.634  
##  Max.   :3.50171   Max.   :3.9222   Max.   :3.8665   Max.   :30.123  
##     ast_tov          stl_tov          efg_pct           ts_pct      
##  Min.   :0.2445   Min.   :0.1320   Min.   :0.4069   Min.   :0.4294  
##  1st Qu.:0.9114   1st Qu.:0.4061   1st Qu.:0.4675   1st Qu.:0.5105  
##  Median :1.3036   Median :0.5208   Median :0.4866   Median :0.5295  
##  Mean   :1.4672   Mean   :0.5600   Mean   :0.4885   Mean   :0.5307  
##  3rd Qu.:1.9264   3rd Qu.:0.6748   3rd Qu.:0.5066   3rd Qu.:0.5504  
##  Max.   :4.6936   Max.   :1.9345   Max.   :0.6771   Max.   :0.6433

Around 200 players were deleted.

Some feature engineering

The following is optional, but makes sense for a coacher

I.e. we will use variables depending on player’s skills, not coacher’s decisions

# skip variable gp (games played) which is decided by the coach
gp = nba$gp
nba$gp = NULL

# skip the min variable (minutes played per game) which is decided by the coach
min = nba$min
nba$min = NULL

Take care: there are some redundant variables, for instance

  • fgm, fga, and fg_pct
  • fg3m, fg3a, and fg3_pct
  • ftm, fta, and ft_pct
  • oreb, dreb, and reb

Should we skip some of previous variables?

nba = nba[,-c(3,6,9,12,19,20,21,22)]
dim(nba)
## [1] 1002   14

We can also remove variable pts, because pts = fgm+fg3m+ftm

Descriptive Analysis

Our input has dimension \(p=14\), that implies \(2^p\) different relations between the variables.

Remember PCA interpretation and output

pca = prcomp(nba, scale=T)

# Plot the first two scores
data.frame(z1=-pca$x[,1],z2=pca$x[,2]) %>% 
  ggplot(aes(z1,z2,label=names)) + geom_point(size=0) +
  labs(title="PCA", x="PC1", y="PC2") +
  theme_bw() +theme(legend.position="bottom") + geom_text(size=2, hjust=0.6, vjust=0, check_overlap = TRUE) 

We will use this representation when plotting our clusters. That is, the position of the players will be the same, but we will add colors for different clusters. This will be called the clusplot.

Clustering

kmeans

Let’s start creating 5 clusters (the typical number of different positions) using kmeans

X = scale(nba)

fit = kmeans(X, centers=5, nstart=100)
groups = fit$cluster
groups
##    [1] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 5 4 4 5 4 4 4 4 4 4 4 4 4 4 5 4 4 4 4 4 5
##   [38] 4 4 4 4 4 4 4 4 5 5 5 4 5 5 4 4 5 4 5 5 4 5 5 4 5 4 4 4 5 4 5 5 5 5 5 4 4
##   [75] 4 4 5 4 5 5 5 5 5 5 4 4 4 5 5 5 4 5 4 4 5 4 5 4 5 4 4 4 5 4 5 4 2 5 5 5 4
##  [112] 4 5 5 4 5 4 5 5 4 5 4 5 4 5 4 5 4 5 4 5 5 4 2 5 5 5 2 5 5 5 4 5 5 5 5 2 5
##  [149] 5 5 2 4 5 4 4 5 5 5 5 5 5 2 4 5 5 4 5 4 5 5 2 5 5 2 5 4 2 5 5 5 5 5 2 5 5
##  [186] 4 5 5 2 5 5 4 5 5 5 5 5 4 5 4 5 5 2 2 2 5 5 2 2 5 5 5 2 5 5 5 2 5 5 5 5 2
##  [223] 5 2 5 5 5 5 5 5 4 2 5 2 2 5 2 2 5 5 5 5 2 5 5 5 5 5 2 2 5 2 5 5 2 2 5 2 2
##  [260] 5 5 5 5 2 5 5 5 2 2 5 5 2 2 5 2 1 5 2 5 5 5 2 5 2 5 5 2 2 2 1 2 2 2 5 2 5
##  [297] 5 5 2 5 2 5 5 2 5 2 2 5 5 5 5 2 5 5 5 2 2 2 2 5 5 5 5 5 2 2 1 2 2 5 5 2 2
##  [334] 2 1 2 2 5 2 2 1 5 2 2 1 1 2 2 1 1 2 2 5 5 2 2 2 1 1 2 2 2 5 1 2 1 5 5 5 5
##  [371] 2 1 2 2 2 2 1 1 2 1 2 2 2 5 5 5 2 5 1 2 1 5 1 2 2 2 1 5 1 2 2 5 2 1 5 5 1
##  [408] 5 5 1 5 1 2 2 1 5 1 2 5 1 1 2 2 2 2 2 1 5 2 2 2 5 1 1 2 2 2 2 1 1 5 2 5 1
##  [445] 1 1 2 1 5 2 1 1 1 2 5 2 1 1 2 1 2 2 2 1 1 2 2 2 2 1 2 5 2 2 2 1 2 2 1 5 1
##  [482] 2 2 2 2 2 1 3 1 1 1 1 1 1 1 1 2 1 1 1 1 1 3 1 1 1 2 2 2 1 1 5 1 2 3 1 2 1
##  [519] 1 1 1 2 2 1 2 1 1 2 1 2 3 5 2 2 2 2 1 2 1 1 1 1 1 2 3 1 1 1 1 2 1 3 2 3 2
##  [556] 1 1 1 2 2 1 2 1 1 1 1 2 1 1 2 1 1 1 1 1 1 2 2 1 2 2 1 1 1 2 3 1 1 1 2 3 1
##  [593] 1 3 2 3 3 2 3 1 1 1 1 1 1 1 2 1 1 1 1 3 3 5 1 1 1 1 2 1 2 3 3 1 1 1 3 1 1
##  [630] 1 1 2 1 1 1 2 1 1 1 3 2 1 1 3 1 1 3 3 1 2 2 1 1 2 3 2 1 1 1 3 3 1 1 3 2 2
##  [667] 3 3 2 1 1 1 2 3 1 3 1 2 1 1 1 1 1 3 3 3 3 3 1 1 3 2 1 1 3 2 1 1 3 1 2 3 1
##  [704] 1 1 3 1 1 1 1 1 3 3 1 3 1 1 1 3 3 1 2 1 1 2 1 3 2 3 2 1 1 2 3 3 3 1 2 1 3
##  [741] 1 1 3 2 1 3 1 1 1 2 2 1 3 3 1 1 3 3 3 1 3 3 3 1 1 3 1 1 3 3 1 1 1 3 3 1 1
##  [778] 1 1 1 3 3 3 3 1 3 3 3 3 1 3 3 3 3 2 1 3 1 3 1 2 1 3 1 3 1 1 3 3 1 1 1 1 1
##  [815] 3 3 1 3 1 3 1 3 1 3 1 1 1 3 1 1 3 1 3 3 1 3 3 3 2 1 3 3 3 3 3 1 3 3 3 3 3
##  [852] 1 2 3 3 1 1 3 1 3 1 3 3 3 3 3 3 2 1 3 3 3 3 3 1 3 1 1 3 1 3 1 3 1 3 3 3 3
##  [889] 3 3 3 3 3 3 3 3 3 3 1 1 3 3 1 3 3 1 3 1 1 3 3 1 3 3 3 3 3 3 3 1 3 3 3 3 1
##  [926] 3 3 2 1 1 3 3 3 1 3 3 3 3 1 3 3 1 3 3 1 3 3 1 3 1 3 3 1 3 3 3 3 3 3 3 3 3
##  [963] 3 3 3 3 3 1 3 3 3 3 3 3 3 3 1 3 1 3 3 3 3 3 3 1 1 3 3 3 3 3 3 3 3 3 3 3 3
## [1000] 3 3 3

The output is one category for each player. This is usually the case for all the (deterministic) clustering tools

Are the groups well balanced?

barplot(table(groups), col="blue")

Insights?

Interpretation

Let’s understand the meaning of the centers (artificial players representing each group)

centers=fit$centers
centers
##          fgm         fga       fg3m       fg3a         ftm        fta
## 1 -0.5408269 -0.44447906  0.4682255  0.4611711 -0.59818319 -0.6710123
## 2  0.1431751  0.02855033 -0.6049455 -0.6160810  0.09779051  0.2061365
## 3 -0.9698070 -1.04295119 -0.7000063 -0.7264866 -0.76539380 -0.6948018
## 4  1.8784138  1.73629737 -0.2964663 -0.2682624  1.98126677  2.0461817
## 5  0.7750681  0.90138820  0.8515529  0.8877014  0.63289659  0.5186647
##         oreb       dreb         ast         stl        blk        tov
## 1 -0.8781773 -0.7647460  0.03807134 -0.04296428 -0.6252656 -0.5197854
## 2  1.0378050  0.9235593 -0.43232608 -0.23128498  0.7148681  0.1430987
## 3  0.2102893 -0.2276999 -0.86271930 -0.85614815  0.1807596 -0.9330711
## 4  1.1369092  1.4867162  0.44702775  0.63465432  0.9245510  1.5438121
## 5 -0.5854997 -0.3150530  1.09838679  0.91595742 -0.4780315  0.8585740
##            pf         pts
## 1 -0.82443066 -0.50551963
## 2  0.88547528  0.06391052
## 3 -0.05467881 -1.00377044
## 4  0.93418554  1.88247739
## 5 -0.13194480  0.84073737

Who are the players in the first group? And in the second one?

i=1  # plotting the centers in cluster 1
bar1=barplot(centers[i,], las=2, col="darkblue", ylim=c(-2,2), main=paste("Cluster", i,": Group center in blue, global center in red"))
points(bar1,y=apply(X, 2, quantile, 0.50),col="red",pch=19)

The blue bars represent the center of a group, whereas the red points represent the center of all the NBA players

i=2  # plotting the centers in cluster 2
bar2=barplot(centers[i,], las=2, col="darkblue", ylim=c(-2,2), main=paste("Cluster", i,": Group center in blue, global center in red"))
points(bar2,y=apply(X, 2, quantile, 0.50),col="red",pch=19)

The clusplot

fviz_cluster(fit, data = X, geom = c("point"),ellipse.type = 'norm', pointsize=1)+
  theme_minimal()+geom_text(label=names,hjust=0, vjust=0,size=2,check_overlap = T)+scale_fill_brewer(palette="Paired")

The Silhouette plot

The silhouette value in [-1,1] measures the similarity (cohesion) of a data point to its cluster relative to other clusters (separation).

Silhouette plots rely on a distance metric and suggest that the data matches its own cluster well.

The larger the silhouette widths, the better.

d <- dist(X, method="euclidean")  
sil = silhouette(groups, d)
plot(sil, col=1:5, main="", border=NA)

summary(sil)
## Silhouette of 1002 units in 5 clusters from silhouette.default(x = groups, dist = d) :
##  Cluster sizes and average silhouette widths:
##       278       205       220        93       206 
## 0.1965663 0.1813903 0.3318040 0.1220498 0.1440822 
## Individual silhouette widths:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.1453  0.1071  0.2125  0.2054  0.3124  0.5298

How many groups?

Let’s see what happens with two groups

fit = kmeans(X, centers=2, nstart=100)
groups = fit$cluster
groups
##    [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##   [38] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##   [75] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [112] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [149] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [186] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [223] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2
##  [260] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 1 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2
##  [297] 2 2 1 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 1 2
##  [334] 2 1 1 2 2 2 2 1 2 1 2 1 1 1 2 1 1 2 2 2 2 2 1 2 1 1 2 2 2 2 1 1 1 2 2 2 2
##  [371] 2 1 1 2 2 1 1 1 2 1 1 2 1 1 1 2 1 2 1 1 1 2 1 2 1 1 1 2 1 1 1 2 1 1 2 1 1
##  [408] 2 1 1 1 1 1 1 1 2 1 1 2 1 1 2 1 1 2 1 1 2 2 1 2 2 1 1 1 1 2 1 1 1 2 1 2 1
##  [445] 1 1 1 1 2 1 1 1 1 1 1 2 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1
##  [482] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 2 1 1 1 2 1 1 1 1 1 1
##  [519] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [556] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [593] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [630] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [667] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [704] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [741] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [778] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [815] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [852] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [889] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [926] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [963] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [1000] 1 1 1
barplot(table(groups), col="blue")

centers=fit$centers

i=1  # plottinng the centers in cluster 1
bar1=barplot(centers[i,], las=2, col="darkblue", ylim=c(-2,2), main=paste("Cluster", i,": Group center in blue, global center in red"))
points(bar1,y=apply(X, 2, quantile, 0.50),col="red",pch=19)

i=2  # plottinng the centers in cluster 2
bar2=barplot(centers[i,], las=2, col="darkblue", ylim=c(-2,2), main=paste("Cluster", i,": Group center in blue, global center in red"))
points(bar2,y=apply(X, 2, quantile, 0.50),col="red",pch=19)

Now the interpretation is easier, isn’t it?

Clusplot

fviz_cluster(fit, data = X, geom = c("point"),ellipse.type = 'norm', pointsize=1)+
  theme_minimal()+geom_text(label=names,hjust=0, vjust=0,size=2,check_overlap = T)+scale_fill_brewer(palette="Paired")

Silhouette plot

d <- dist(X, method="euclidean")  
sil = silhouette(groups,d)
plot(sil, col=1:2, main="", border=NA)

summary(sil)
## Silhouette of 1002 units in 2 clusters from silhouette.default(x = groups, dist = d) :
##  Cluster sizes and average silhouette widths:
##       625       377 
## 0.3799295 0.1496988 
## Individual silhouette widths:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.1292  0.1801  0.3036  0.2933  0.4434  0.5507

In terms of Silhouette plot, it seems 2 clusters is better than 5: the average width is higher

Hence, how many clusters?

In theory, we cannot answer this question. In practice, we can have some hints using screeplots:

Based on wss (total within sum of square)

fviz_nbclust(X, kmeans, method = 'wss')

Based on

fviz_nbclust(X, kmeans, method = 'silhouette')

Based on the gap statistic (using bootstrap)

fviz_nbclust(X, kmeans, method = 'gap_stat', k.max = 20)

Insights?

Profile variables

We have some variables not included in the clustering. These are called profiles and may help us to understand better the clusters.

Let’s consider 3 clusters

fit = kmeans(X, centers=3, nstart=100)
groups = fit$cluster

Let’s see the distribution of the clusters considering the minutes played

as.data.frame(X) %>% mutate(cluster=factor(groups), names=names, min=min, gp=gp) %>%
  ggplot(aes(x = cluster, y = min)) + 
  geom_boxplot(fill="lightblue") +
  labs(title = "Minutes played by cluster", x = "", y = "", col = "") 

Seems players in one cluster play less minutes per game. It seems this cluster contains the worse players

Who are the other two clusters?

as.data.frame(X) %>% mutate(cluster=factor(groups), names=names, min=min, gp=gp) %>%
  ggplot(aes(x = cluster, y = gp)) + 
  geom_boxplot(fill="lightblue") +
  labs(title = "Games played by cluster", x = "", y = "", col = "") 

Similar insights: one cluster has less experience

Clusplot

fviz_cluster(fit, data = X, geom = c("point"),ellipse.type = 'norm', pointsize=1)+
  theme_minimal()+geom_text(label=names,hjust=0, vjust=0,size=2,check_overlap = T)+scale_fill_brewer(palette="Paired")

The clusplot is clearer: one cluster contains the worst players, and the best players are divided in two: interior and exterior players.

kmeans with Mahalanobis distance

By default kmeans uses the Euclidean distance: circular distribution not considering correlations between variables

But how to incorporate these correlations, i.e. an elliptical distribution?

The Mahalanobis distance or the multivariate standardization:

S_x <- cov(nba)
iS <- solve(S_x)
e <- eigen(iS)
V <- e$vectors
B <- V %*% diag(sqrt(e$values)) %*% t(V)
Xtil <- scale(nba,scale = FALSE)
nbaS <- Xtil %*% B

Take care: here we are assuming all the clusters have the same covariance matrix, that is, the same shape and orientations as the whole data

If this is not the case, we need mixture models (probabilistic clusters)

kmeans with 3 clusters assuming an elliptic distribution

fit.mahalanobis = kmeans(nbaS, centers=3, nstart=100)
groups = fit.mahalanobis$cluster
centers=fit.mahalanobis$centers
colnames(centers)=colnames(X)
centers
##           fgm         fga       fg3m       fg3a         ftm         fta
## 1  0.03606579 -0.07584307 -0.2188226 -0.5457885  0.06222957  0.01504068
## 2 -0.09370113  0.20135432  0.4021693  1.1715184 -0.09005059 -0.10119227
## 3  0.06876773 -0.15472148 -0.0270122 -0.4624777 -0.04936909  0.17439345
##         oreb        dreb         ast          stl         blk          tov
## 1  0.0726238 -0.11726620  0.04059037  0.070661922 -0.39253699 -0.025349309
## 2 -0.2774607 -0.08787219  0.04525767  0.007220461 -0.09184802  0.004275862
## 3  0.3467390  0.69724359 -0.27615960 -0.312868228  1.85939616  0.096131549
##            pf          pts
## 1  0.06594311 -0.089565553
## 2 -0.14304438  0.159091306
## 3  0.05939500  0.001891169
i=1  # plotting the centers in cluster 1
bar1=barplot(centers[i,], las=2, col="darkblue", ylim=c(-2,2), main=paste("Cluster", i,": Group center in blue, global center in red"))
points(bar1,y=apply(X, 2, quantile, 0.50),col="red",pch=19)

i=2  # plotting the centers in cluster 2
bar2=barplot(centers[i,], las=2, col="darkblue", ylim=c(-2,2), main=paste("Cluster", i,": Group center in blue, global center in red"))
points(bar2,y=apply(X, 2, quantile, 0.50),col="red",pch=19)

i=3  # plotting the centers in cluster 3
bar3=barplot(centers[i,], las=2, col="darkblue", ylim=c(-2,2), main=paste("Cluster", i,": Group center in blue, global center in red"))
points(bar3,y=apply(X, 2, quantile, 0.50),col="red",pch=19)

clusplot

fviz_cluster(fit.mahalanobis, data = X, geom = c("point"),ellipse.type = 'norm', pointsize=1)+
  theme_minimal()+geom_text(label=names,hjust=0, vjust=0,size=2,check_overlap = T)+scale_fill_brewer(palette="Paired")

Different insights now…

How similar are the clusters?

The Adjusted Rand index compares two classifications: the closer to 1 the more agreement

library(mclust)
adjustedRandIndex(fit$cluster, fit.mahalanobis$cluster) 
## [1] 0.09788451

Hence, the Euclidean distance gives very different clusters than the Mahalanobis one

Which clustering do you prefer?